Analyzing Fine-grained Hypertext Features for Enhanced Crawling and Topic Distillation
نویسندگان
چکیده
Early Web search engines closely resembled Information Retrieval (IR) systems which had matured over several decades. Around 1996–1999, it became clear that the spontaneous formation of hyperlink communities in the Web graph had much to offer to Web search, leading to a flurry of research on hyperlink-based ranking of query responses. In this paper we show that, over and above inter-page hyperlinks, much semantic information can be teased out of the manner in which markup tags, such as menu-bars, tables, and lists are used to organize pages, and the context in which hyperlinks are made from a page to another. We believe that delving into fine-grained page structure will be the next wave in hypertext mining, bridging some of the gap between unstructured HTML and relatively structured XML, and preparing the stage for deeper analysis into page snippets. We talk about two applications to illustrate fine-grained page analysis: topic-guided focused crawling, and hyperlink-based ranking which takes page structure into account. In both cases, we show that exploiting fine-grained structure enhances the performance of our systems.
منابع مشابه
1st International Workshop on Web Dynamics
After crawling and keyword indexing, the next wave that has made a significant impact on Web search is topic distillation: analyzing properties of the hyperlink graph for enhanced ranking of Web pages in response to a query. Hyperlink induced topic search (HITS) and PageRank (used in Google) are two examples. The linear algebra involved in HITS and PageRank is standard, but selecting the releva...
متن کاملLexical Profiling of Existing Web Directories to Support Fine-grained Topic-Focused Web Crawling
Topic-focused Web crawling aims to harness the potential of the Internet reliably and efficiently, producing topic specific indexes of pages within the Web. Previous work has focused on supplying suitably general descriptions of topics to generate large general indexes. In this paper we propose a method that uses lexical profiling of a corpus that consists of hierarchical structures in existing...
متن کاملINFSCI 2910 - Independent Study : Foundations ( Fall Term
With the fast grow of online educational content, the abundance of quality material opens new opportunities to learners. For practically any domain, a learner can easily find tens, hundreds or even thousand of web pages, tutorials and electronic textbooks in the Internet. Modern educational systems can grasp this opportunity offering alternative content to their users by automatic linking simil...
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IEEE Data Eng. Bull.
دوره 25 شماره
صفحات -
تاریخ انتشار 2002